Instructions on how to use an R/Bioconductor package “Pi”, a genetics-led system for the discovery and prioritisation of drug targets at the gene and pathway level. Taking as inputs disease-associated SNPs and their significance level (GWAS summary data) for specific immune traits, Pi uses genomic predictors to identify and score the likely modulated genes (termed seed genes) responsible for these genetic associations, on the basis of (i) genomic proximity of a gene to a trait-associated SNP (nGene); (ii) physical interaction between disease-associated SNPs and specific gene promoters evidenced by chromatin conformation (cGene) for primary immune cells; and (iii) modulation of gene expression (eGene) using expression quantitative trait loci (eQTL) mapped for major immune cell types in resting and activated states. Pi additionally uses annotation predictors comprising evidence for a gene with an immune function (fGene); immune phenotype (pGene); and rare genetic diseases related to immunity or the immune system (dGene). Pi also incorporates knowledge of gene interactions to iteratively explore network information and estimate connectivity (affinity scores) of a given gene product (network node) to a set of scored seed genes/nodes, identifying non-seed genes that lack genetic evidence but are highly ranked on the basis of network connectivity with prioritised seed genes. As such, a gene-predictor matrix that combines genomic and annotation predictors with knowledge of affinity scores is constructed from GWAS summary data. The prioritisation of target genes is enabled through meta-analysis (discovery mode) to prioritise target genes with substantial genetic support and/or with rich network connectivity. The prioritisation of target pathways individually and at crosstalk is based on the prioritised target genes.
Pi 1.7.3
Hai Fang, The ULTRA-DD Consortium, Julian C Knight. Pi: an R/Bioconductor package leveraging genetic evidence to prioritize drug targets at the gene and pathway level. Bioconductor (2018); doi:10.18129/B9.bioc.Pi
# stable release version from Bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Pi")
# development version from github
source("http://bioconductor.org/biocLite.R")
biocLite("devtools")
devtools::install_github("hfang-bristol/Pi")
We are grateful to have your feedbacks particularly bug reports. Please file issues here for any bugs or other technical matters.
Figure 1: Pi pipeline
Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (on genome-wide).
Figure 2: Pi functions
Designed in a nested way following this route: xPierSNPsAdv -> xPierSNPs -> xPierGenes -> xPier -> xRWR to prepare a gene-predict matrix and prioritisation of targe genes (xPierMatrix) taking as inputs disease-associated SNPs and their significance level (GWAS summary data) for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (xPierPathways) or pathway crosstalk identification (xPierSubnet)
Using curated GWAS summary data for rheumatoid arthritis (RA) as inputs, this section gives a step-by-step demo showing how to use Pi to achieve disease-specific prioritisation of drug targets at the gene and pathway level
First of all, load the package and specify the location of built-in data.
library(Pi)
RData.location <- "http://pi.well.ox.ac.uk/bigdata"
GWAS summary data for RA are primarily sourced from GWAS Catalog and ImmunoBase.
data.file <- "http://pi.well.ox.ac.uk/bigdata/GWAS_summary_data.RA.txt"
data <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)
| SNP | pvalue |
|---|---|
| rs998731 | 7.0e-09 |
| rs9979383 | 5.0e-10 |
| rs9826828 | 8.6e-10 |
| rs968567 | 1.8e-08 |
| rs9653442 | 1.6e-18 |
Parameters used
include.LD <- 'EUR'
LD.r2 <- 0.8
LD.customised <- NULL
significance.threshold <- 5e-8
score.cap <- 10
distance.max <- 20000
decay.kernel <- "constant"
decay.exponent <- 2
eQTL.customised <- NULL
GR.SNP <- "dbSNP_GWAS"
#GR.SNP <- xRDataLoader('dbSNP_GWAS', RData.location=RData.location) # equivalent to this
GR.Gene <- "UCSC_knownGene"
#GR.Gene <- xRDataLoader('UCSC_knownGene', RData.location=RData.location) # equivalent to this
include.TAD <- "GM12878" # lymphoblast, reflective of immune-context genomic organisation
include.eQTL <- c("JKscience_CD14","JKscience_LPS2","JKscience_LPS24","JKscience_IFN","JKng_bcell","JKpg_CD4","JKpg_CD8","JKnc_neutro", "JK_nk", "WESTRAng_blood")
eQTL.customised <- NULL
include.HiC <- c("Monocytes","Macrophages_M0","Macrophages_M1","Macrophages_M2","Neutrophils","Naive_CD4_T_cells","Total_CD4_T_cells","Naive_CD8_T_cells","Total_CD8_T_cells","Naive_B_cells","Total_B_cells")
cdf.function <- "empirical"
scoring.scheme <- 'max'
network <- "STRING_high"
STRING.only <- NA
weighted <- FALSE
network.customised <- NULL
#network.customised <- xDefineNet("STRING_high", RData.location=RData.location) # equivalent to this
seeds.inclusive <- TRUE
normalise <- "laplacian"
restart <- 0.7
normalise.affinity.matrix <- "none"
parallel <- TRUE
multicores <- NULL
verbose <- TRUE
| Code | Name |
|---|---|
| JKscience_CD14 | CD14+ (monocytes) |
| JKscience_LPS2 | CD14+ by LPS2h |
| JKscience_LPS24 | CD14+ by LPS24h |
| JKscience_IFN | CD14+ by IFN24h |
| JKng_bcell | B cells |
| JKpg_CD4 | CD4+ T cells |
| JKpg_CD8 | CD8+ T cells |
| JKnc_neutro | neutrophils |
| JK_nk | NK cells |
| WESTRAng_blood | peripheral blood |
| Code | Name |
|---|---|
| Monocytes | monocytes |
| Macrophages_M0 | macrophages (M0) |
| Macrophages_M1 | macrophages (M1) |
| Macrophages_M2 | macrophages (M2) |
| Neutrophils | neutrophils |
| Naive_CD4_T_cells | CD4+ T cells (naïve) |
| Total_CD4_T_cells | CD4+ T cells (total) |
| Naive_CD8_T_cells | CD8+ T cells (naïve) |
| Total_CD8_T_cells | CD8+ T cells (total) |
| Naive_B_cells | B cells (naïve) |
| Total_B_cells | B cells (total) |
Prioritisation
# prepare predictors
## first, genomic predictors
ls_pNode_genomic <- xPierSNPsAdv(data, include.LD=include.LD, LD.customised=LD.customised, LD.r2=LD.r2, significance.threshold=significance.threshold, score.cap=score.cap, distance.max=distance.max, decay.kernel=decay.kernel, decay.exponent=decay.exponent, GR.SNP=GR.SNP, GR.Gene=GR.Gene, include.TAD=include.TAD, include.eQTL=include.eQTL, eQTL.customised=eQTL.customised, include.HiC=include.HiC, cdf.function=cdf.function, scoring.scheme=scoring.scheme, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
## then, annotation predictors
data.file <- file.path(RData.location, "iAnno.txt")
iAnno <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)
iA <- iAnno[, 1:14]
ls_pNode_anno <- lapply(9:11, function(j){
data_anno <- data.frame(seed=iA$Symbol, weight=iA[,j], stringsAsFactors=F)
data_anno <- subset(data_anno, weight>0)
pNode <- xPierAnno(data_anno, list_pNode=ls_pNode_genomic, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
})
names(ls_pNode_anno) <- paste("Annotation_", colnames(iA)[9:11], sep='')
## bring together both predictors
ls_pNode <- c(ls_pNode_anno, ls_pNode_genomic)
# predictor matrix
df_predictor <- xPierMatrix(ls_pNode, displayBy="score", combineBy="union", aggregateBy="none", RData.location=RData.location)
# Prioritisation in a discovery mode
dTarget <- xPierMatrix(ls_pNode, displayBy="pvalue", combineBy="union", aggregateBy="fishers", RData.location=RData.location)
The results are stored in an object called dTarget, a list with components including a data frame priority.
dTarget
## An object of S3 class 'dTarget', with 3 components:
## $metag: an igraph object with 15430 nodes and 313918 edges
## $predictor: a data frame of 15430 rows X 25 columns
## $priority: a data frame of 15430 rows X 13 columns
##
## --------------------------------------------------
## $priority:
## name rank pvalue fdr priority
## HLA-DRB1 1 2.715127e-43 4.189441e-39 5.000000
## HLA-DQA1 2 2.003228e-41 1.545490e-37 4.889061
## description Overall OMIM
## major histocompatibility complex, class II, DR beta 1 4 1
## major histocompatibility complex, class II, DQ alpha 1 5 1
## Phenotype Function nearbyGenes eQTL HiC
## 0 0 1 6 7
## 1 1 1 0 9
## ......
The data frame dTarget$priority can be written into a text file Pi_output_priority.txt:
write.table(dTarget$priority, file="Pi_output_priority.txt", sep="\t", row.names=FALSE)
Manhattan plot
gp_manhattan <- xPierManhattan(dTarget, color='ggplot2', top=30, top.label.type="text", top.label.size=3, top.label.col="black", y.scale="normal", y.lab="Pi rating", RData.location=RData.location)
gp_manhattan
Figure 3: Manhattan plot
Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named
Pathway-level prioritisation is based on enrichment analysis of top 1% prioritised genes (top 150 genes) using a compendium of pathways (sourced from KEGG and REACTOME).
priority.top <- 150
eTerm1 <- xPierPathways(dTarget, priority.top=priority.top, ontology="KEGGorganismal", size.range=c(10,5000), test="fisher", min.overlap=10, RData.location=RData.location)
eTerm2 <- xPierPathways(dTarget, priority.top=priority.top, ontology="KEGGenvironmental", size.range=c(10,5000), test="fisher", min.overlap=10, RData.location=RData.location)
list_eTerm <- list(eTerm1, eTerm2)
names(list_eTerm) <- c('Organismal systems','Environmental information processing')
gp_kegg <- xEnrichCompare(list_eTerm, displayBy="zscore", FDR.cutoff=0.01, facet=F, signature=F)
gp_kegg
Figure 4: Barplot of prioritised KEGG pathways
priority.top <- 150
eTerm <- xPierPathways(dTarget, priority.top=priority.top, ontology="REACTOME", size.range=c(10,5000), test="fisher", min.overlap=10, RData.location=RData.location)
gp_reactome <- xEnrichForest(eTerm, top_num=10, FDR.cutoff=0.01, CI.one=F, zlim=c(0,30), colormap="ggplot2.top", wrap.width=50, signature=F)
gp_reactome
Figure 5: Forest plot of prioritised REACTOME pathways
An algorithm is developed to search for a subset of a gene network (merged from KEGG pathways) in a way that the resulting gene subnetwork (or crosstalk between different pathways) contains highly prioritized genes with a few less prioritized genes as linkers.
# obtain a gene network merged from KEGG pathways
networks <- c("KEGG_environmental","KEGG_organismal")
ls_ig <- lapply(networks, function(network){
g <- xDefineNet(network=network, weighted=FALSE, verbose=FALSE, RData.location=RData.location)
})
ig <- xCombineNet(ls_ig, combineBy='union', attrBy="intersect", verbose=TRUE)
# search for a subset (with a desired node number of ~30) of the gene network
g <- xPierSubnet(dTarget, network.customised=ig, priority.quantile=0.1, subnet.size=30, RData.location=RData.location)
# visualise pathway crosstalk with nodes colored by priority rating and labelled with evidence (seed genes)
gp_rating_evidence <- xVisEvidenceAdv(dTarget, nodes=V(g)$name, g=g, colormap="jet.top")
gp_rating_evidence
Figure 6: Visualisation of pathway crosstalk
Nodes/genes colored by priority rating and labelled with evidence (seed genes; colored segments)
Here is the output of sessionInfo() on the system on which this user manual was compiled:
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] bindrcpp_0.2 png_0.1-7 BiocStyle_2.7.8 Pi_1.7.3
## [5] XGR_1.1.4 ggplot2_2.2.1 dnet_1.1.4 supraHex_1.13.3
## [9] hexbin_1.27.1 igraph_1.1.2
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.3 RSQLite_2.0
## [3] AnnotationDbi_1.40.0 htmlwidgets_0.9
## [5] BiocParallel_1.12.0 munsell_0.4.3
## [7] units_0.4-6 codetools_0.2-15
## [9] misc3d_0.8-4 withr_2.1.0
## [11] colorspace_1.3-2 BiocInstaller_1.28.0
## [13] Biobase_2.38.0 OrganismDbi_1.20.0
## [15] highr_0.6 knitr_1.20
## [17] rstudioapi_0.7 stats4_3.4.3
## [19] ROCR_1.0-7 robustbase_0.92-8
## [21] dimRed_0.1.0 labeling_0.3
## [23] GenomeInfoDbData_0.99.1 mnormt_1.5-5
## [25] bit64_0.9-7 rprojroot_1.2
## [27] xfun_0.1 ipred_0.9-6
## [29] biovizBase_1.26.0 randomForest_4.6-12
## [31] R6_2.2.2 doParallel_1.0.11
## [33] GenomeInfoDb_1.14.0 AnnotationFilter_1.2.0
## [35] DRR_0.0.2 bitops_1.0-6
## [37] reshape_0.8.7 DelayedArray_0.4.1
## [39] assertthat_0.2.0 scales_0.5.0
## [41] nnet_7.3-12 gtable_0.2.0
## [43] ddalpha_1.3.1 ggbio_1.26.1
## [45] ensembldb_2.2.0 timeDate_3042.101
## [47] rlang_0.1.6 CVST_0.2-1
## [49] RcppRoll_0.2.2 splines_3.4.3
## [51] rtracklayer_1.38.1 lazyeval_0.2.1
## [53] ModelMetrics_1.1.0 acepack_1.4.1
## [55] dichromat_2.0-0 broom_0.4.3
## [57] checkmate_1.8.5 intergraph_2.0-2
## [59] yaml_2.1.18 reshape2_1.4.2
## [61] GenomicFeatures_1.30.0 ggnetwork_0.5.1
## [63] backports_1.1.1 httpuv_1.3.5
## [65] Hmisc_4.0-3 RMySQL_0.10.13
## [67] RBGL_1.54.0 caret_6.0-77
## [69] tools_3.4.3 lava_1.5.1
## [71] bookdown_0.7 psych_1.7.8
## [73] statnet.common_4.0.0 gplots_3.0.1
## [75] RColorBrewer_1.1-2 BiocGenerics_0.24.0
## [77] Rcpp_0.12.15 plyr_1.8.4
## [79] base64enc_0.1-3 progress_1.1.2
## [81] zlibbioc_1.24.0 purrr_0.2.4
## [83] RCurl_1.95-4.8 prettyunits_1.0.2
## [85] deldir_0.1-14 rpart_4.1-11
## [87] S4Vectors_0.16.0 sfsmisc_1.1-1
## [89] SummarizedExperiment_1.8.0 ggrepel_0.7.0
## [91] cluster_2.0.6 magrittr_1.5
## [93] data.table_1.10.4-3 sna_2.4
## [95] ProtGenerics_1.10.0 matrixStats_0.52.2
## [97] evaluate_0.10.1 mime_0.5
## [99] xtable_1.8-2 XML_3.98-1.9
## [101] IRanges_2.12.0 gridExtra_2.3
## [103] compiler_3.4.3 biomaRt_2.34.0
## [105] tibble_1.4.2 KernSmooth_2.23-15
## [107] htmltools_0.3.6 Formula_1.2-2
## [109] udunits2_0.13 tidyr_0.7.2
## [111] lubridate_1.7.1 DBI_0.7
## [113] tweenr_0.1.5 MASS_7.3-47
## [115] Matrix_1.2-12 gdata_2.18.0
## [117] parallel_3.4.3 Gviz_1.22.1
## [119] bindr_0.1 gower_0.1.2
## [121] GenomicRanges_1.30.0 pkgconfig_2.0.1
## [123] GenomicAlignments_1.14.1 RCircos_1.2.0
## [125] foreign_0.8-69 recipes_0.1.1
## [127] foreach_1.4.3 XVector_0.18.0
## [129] prodlim_1.6.1 stringr_1.3.0
## [131] VariantAnnotation_1.24.2 digest_0.6.12
## [133] graph_1.56.0 Biostrings_2.46.0
## [135] rmarkdown_1.9 htmlTable_1.11.0
## [137] curl_3.0 kernlab_0.9-25
## [139] shiny_1.0.5 Rsamtools_1.30.0
## [141] gtools_3.5.0 nlme_3.1-131
## [143] network_1.13.0 BSgenome_1.46.0
## [145] pillar_1.1.0 lattice_0.20-35
## [147] GGally_1.3.2 httr_1.3.1
## [149] DEoptimR_1.0-8 survival_2.41-3
## [151] interactiveDisplayBase_1.16.0 glue_1.2.0
## [153] iterators_1.0.8 plot3D_1.1.1
## [155] glmnet_2.0-13 bit_1.1-12
## [157] Rgraphviz_2.22.0 ggforce_0.1.1
## [159] class_7.3-14 stringi_1.1.7
## [161] blob_1.1.0 AnnotationHub_2.10.1
## [163] latticeExtra_0.6-28 caTools_1.17.1
## [165] memoise_1.1.0 dplyr_0.7.4
## [167] ape_5.0